alignment research
Misalignment or misuse? The AGI alignment tradeoff
Hellrigel-Holderbaum, Max, Dung, Leonard
Creating systems that are aligned with our goals is seen as a leading approach to create safe and beneficial AI in both leading AI companies and the academic field of AI safety. We defend the view that misaligned AGI - future, generally intelligent (robotic) AI agents - poses catastrophic risks. At the same time, we support the view that aligned AGI creates a substantial risk of catastrophic misuse by humans. While both risks are severe and stand in tension with one another, we show that - in principle - there is room for alignment approaches which do not increase misuse risk. We then investigate how the tradeoff between misalignment and misuse looks em pirically for different technical approaches to AI alignment. Here, we argue that many current alignment techniques and foreseeable improvements thereof plausibly increase risks of catastrophic misuse. Since the impacts of AI depend on the social context, we close by discussing important social factors and suggest that to reduce the risk of a misuse catastrophe due to aligned AGI, techniques such as robustness, AI control methods and especially good governance seem essential.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (4 more...)
- Government (1.00)
- Health & Medicine (0.67)
- Information Technology > Security & Privacy (0.46)
The Economics of p(doom): Scenarios of Existential Risk and Economic Growth in the Age of Transformative AI
Growiec, Jakub, Prettner, Klaus
Recent advances in artificial intelligence (AI) have led to a diverse set of predictions about its long-term impact on humanity. A central focus is the potential emergence of transformative AI (TAI), eventually capable of outperforming humans in all economically valuable tasks and fully automating labor. Discussed scenarios range from human extinction after a misaligned TAI takes over ("AI doom") to unprecedented economic growth and abundance ("post-scarcity"). However, the probabilities and implications of these scenarios remain highly uncertain. Here, we organize the various scenarios and evaluate their associated existential risks and economic outcomes in terms of aggregate welfare. Our analysis shows that even low-probability catastrophic outcomes justify large investments in AI safety and alignment research. We find that the optimizing representative individual would rationally allocate substantial resources to mitigate extinction risk; in some cases, she would prefer not to develop TAI at all. This result highlights that current global efforts in AI safety and alignment research are vastly insufficient relative to the scale and urgency of existential risks posed by TAI. Our findings therefore underscore the need for stronger safeguards to balance the potential economic benefits of TAI with the prevention of irreversible harm. Addressing these risks is crucial for steering technological progress toward sustainable human prosperity.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > New York (0.04)
- North America > United States > California (0.04)
- (10 more...)
- Government > Military (1.00)
- Banking & Finance > Economy (1.00)
The AI Alignment Paradox
The release of GPT-3, and later ChatGPT, catapulted large language models from the proceedings of computer science conferences to newspaper headlines across the globe, fueling their rise to one of today's most hyped technologies. The public's awe about GPT-3's knowledge and fluency was quickly blemished by concerns regarding its potential to radicalize, instigate, and misinform, for example, by stating that Bill Gates aimed to "kill billions of people with vaccines" or that Hillary Clinton was a "high-level satanic priestess."4 These shortcomings, in turn, have sparked a surge in research on AI alignment,7 a field aiming to "steer AI systems toward a person's or group's intended goals, preferences, and ethical principles" (definition by Wikipedia). A well-aligned AI system will "understand" what is "good" and what is "bad" and will do only the "good" while avoiding the "bad."a The resulting techniques, including instruction fine-tuning, reinforcement learning from human feedback, and so forth, have contributed in major ways to improving the output quality of large language models.
Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment
Liu, Yan, Yi, Xiaoyuan, Chen, Xiaokang, Yao, Jing, Yi, Jingwei, Zan, Daoguang, Liu, Zheng, Xie, Xing, Ho, Tsung-Yi
The demand for regulating potentially risky behaviors of large language models (LLMs) has ignited research on alignment methods. Since LLM alignment heavily relies on reward models for optimization or evaluation, neglecting the quality of reward models may cause unreliable results or even misalignment. Despite the vital role reward models play in alignment, previous works have consistently overlooked their performance and used off-the-shelf reward models arbitrarily without verification, rendering the reward model ``\emph{an elephant in the room}''. To this end, this work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation. Furthermore, we systematically study the impact of reward model quality on alignment performance in three reward utilization paradigms. Extensive experiments reveal that better reward models perform as better human preference proxies. This work aims to awaken people to notice this huge elephant in alignment research. We call attention to the following issues: (1) The reward model needs to be rigorously evaluated, whether for alignment optimization or evaluation. (2) Considering the role of reward models, research efforts should not only concentrate on alignment algorithm, but also on developing more reliable human proxy.
- Asia > China > Hong Kong (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
Cyborgism - LessWrong
It pursues this goal without further human intervention. For example, we create an AI that wants to stop global warming, then let it do its thing. Genie: An AI that follows orders. For example, you could tell it "Write and send an angry letter to the coal industry", and it will do that, then await further instructions.
Conditioning Predictive Models: Risks and Strategies
Hubinger, Evan, Jermyn, Adam, Treutlein, Johannes, Hudson, Rubi, Woolverton, Kate
Our intention is to provide a definitive reference on what it would take to safely make use of generative/predictive models in the absence of a solution to the Eliciting Latent Knowledge problem. Furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs. Unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other AI systems, potentially unbeknownst to us. There are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want (e.g. humans) rather than the things we don't (e.g. malign AIs). Furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of. As a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (5 more...)
- Instructional Material (0.67)
- Research Report > Promising Solution (0.34)
- Government (1.00)
- Leisure & Entertainment > Games (0.67)
Methodological reflections for AI alignment research using human feedback
Hagendorff, Thilo, Fabi, Sarah
The field of artificial intelligence (AI) alignment aims to investigate whether AI technologies align with human interests and values and function in a safe and ethical manner. AI alignment is particularly relevant for large language models (LLMs), which have the potential to exhibit unintended behavior due to their ability to learn and adapt in ways that are difficult to predict. In this paper, we discuss methodological challenges for the alignment problem specifically in the context of LLMs trained to summarize texts. In particular, we focus on methods for collecting reliable human feedback on summaries to train a reward model which in turn improves the summarization model. We conclude by suggesting specific improvements in the experimental design of alignment studies for LLMs' summarization capabilities.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Artificial Persuasion Takes Over the World
Blurb: Narrates a fictional future where persuasive Artificial General Intelligence (AGI) goes rogue. Inspired in part by the AI Vignettes Project. A fondness for irony will help readers. "AI-powered memetic warfare makes all humans effectively insane." You can't trust any content from anyone you don't know. Phone calls, texts, and emails are poisoned. But the current waste and harm from scammers, influencers, propagandists, marketers, and their associated algorithms are nothing compared to what might happen. Coming AIs might be super-persuaders, and they might have their own very harmful agendas. People being routinely unsure of what's reality is one bad outcome, but there are others worse.
Our approach to alignment research
Our approach to aligning AGI is empirical and iterative. We are improving our AI systems' ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other alignment problems. Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn't, thus refining our ability to make AI systems safer and more aligned.
Scary AI Is More "Fantasia" Than "Terminator" - Issue 58: Self
When Nate Soares psychoanalyzes himself, he sounds less Freudian than Spockian. As a boy, he'd see people acting in ways he never would "unless I was acting maliciously," the former Google software engineer, who now heads the non-profit Machine Intelligence Research Institute, reflected in a blog post last year. "I would automatically, on a gut level, assume that the other person must be malicious." It's a habit anyone who's read or heard David Foster Wallace's "This is Water" speech will recognize. Later Soares realized this folly when his "models of other people" became "sufficiently diverse"--which isn't to say they're foolproof, he wrote in the same post.